1 Introduction
Create a model capable of estimating the creditworthiness of customers, in order to help the dedicated team understand whether or not to accept a credit card application.
1.1 Features
- ID: customer identification number
- CODE_GENDER: gender of the customer
- FLAGOWNCAR: indicator of car ownership
- FLAGOWNREALTY: indicator of house ownership
- CNT_CHILDREN: number of children
- AMTINCOMETOTAL: annual income
- NAMEINCOMETYPE: type of income
- NAMEEDUCATIONTYPE: level of education
- NAMEFAMILYSTATUS: marital status
- NAMEHOUSINGTYPE: type of dwelling
- DAYS_BIRTH: number of days since birth
- DAYS_EMPLOYED: number of days since employment (if positive, indicates the number of days since being unemployed)
- FLAG_MOBIL: indicator for the presence of a mobile phone number
- FLAGWORKPHONE: indicator of the presence of a work phone number
- FLAG_PHONE: indicator of the presence of a telephone number
- FLAG_EMAIL: indicator for the presence of an email address
- OCCUPATION_TYPE: type of occupation
- CNTFAMMEMBERS: number of family members
- TARGET: variable which is worth 1 if the customer has a high creditworthiness (constant payment of installments), 0 otherwise.
If a customer is denied a credit card, the team must be able to give a reason. This means that your model must provide indications that are easy to interpret.
It is a binary classification which needs a good explainability.
2 Import
2.1 R
2.2 Python
Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
from datetime import date
from dateutil.relativedelta import relativedelta
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler, LabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report2.3 Config class
Code
class Config():
def __init__(self):
"""
Initialization calss.
"""
self.path="/Users/simonebrazzi/R/blog/posts/credit_score/"
self.file="/Users/simonebrazzi/R/blog/posts/credit_score/credit_scoring.csv"
self.random_state=42
self.col_bi = ['code_gender', 'flag_own_car', 'flag_own_realty', 'flag_mobil', 'flag_work_phone', 'flag_phone', 'flag_email']
self.col_ohe = ["name_income_type", "name_education_type", "name_family_status", "name_housing_type", "occupation_type"]
self.col_ss = ['cnt_children', 'amt_income_total', 'cnt_fam_members']
self.features = self.col_bi + self.col_ohe + self.col_ss
config = Config()3 Dataset
Now we can import the dataset. Having all columns UPPERCASE, I will set all of them as lowercase. Also, I will assign target as the y variable and all the remaining columns as x or features.
4 EDA
4.1 Target
It is always important to perform an EDA. In this case I am expecting to have less cases of high creditworthiness than low leveraging personal knowledge. Xonsidering it is a binary classification, it is important to be sure if data are balanced to make a better model selection.
The variable:
- 0 has 308,705.
- 1 has 29,722.
Code
df <- py$df
df_g <- df %>%
rename(label = target) %>%
mutate(label = as.factor(label)) %>%
summarize(
freq_abs = n(),
freq_rel = n() / nrow(df),
.by = label
)
df_g %>%
gt() %>%
fmt_auto() %>%
cols_width(
label ~ pct(20),
freq_abs ~ pct(35),
freq_rel ~ pct(35)
) %>%
cols_align(
align = "center",
columns = c(freq_abs, freq_rel)
) %>%
tab_header(
title = "Label frequency",
subtitle = "Absolute and relative frequencies"
) %>%
cols_label(
label = "Label",
freq_abs = "Absolute frequency",
freq_rel = "Relative frequency"
) %>%
tab_options(
table.width = pct(100)
)| Label frequency | ||
|---|---|---|
| Absolute and relative frequencies | ||
| Label | Absolute frequency | Relative frequency |
| 0 | 308,705 | 0.912 |
| 1 | 29,722 | 0.088 |
Code
Remember: 0. stands for low creditworthiness, 1 for high. Our dataset is strongly unbalanced towards the 0 label. So, my first idea was right.
4.2 Features
Lets check the NA values in each feature.
Code
df_na <- df %>%
summarize(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(
cols = everything(),
names_to = "columns",
values_to = "value"
)
df_na %>%
gt() %>%
tab_options(
table.width = pct(100)
) %>%
cols_align(align = "center") %>%
cols_label(
columns = "Columns",
value = "NA Value"
) %>%
tab_header(
title = "NA values",
subtitle = "Number of NA value for each feature"
) %>%
fmt_number(
columns = c("value"),
decimals = 2,
drop_trailing_zeros = TRUE,
#drop_trailing_dec_mark = FALSE
) | NA values | |
|---|---|
| Number of NA value for each feature | |
| Columns | NA Value |
| id | 0 |
| code_gender | 0 |
| flag_own_car | 0 |
| flag_own_realty | 0 |
| cnt_children | 0 |
| amt_income_total | 0 |
| name_income_type | 0 |
| name_education_type | 0 |
| name_family_status | 1 |
| name_housing_type | 1 |
| days_birth | 1 |
| days_employed | 1 |
| flag_mobil | 1 |
| flag_work_phone | 1 |
| flag_phone | 1 |
| flag_email | 1 |
| occupation_type | 103,342 |
| cnt_fam_members | 1 |
| target | 0 |
There are 2 interesting things to notice:
The column occupation_type has lots of NA value. How much in proportion to the total? Should we drop the NAs or keep them?
Code
df %>%
summarize(
freq_abs = sum(is.na(occupation_type)),
freq_rel = sum(is.na(occupation_type)) / nrow(df)
) %>%
gt() %>%
tab_options(
table.width = pct(100)
) %>%
cols_align(align = "center") %>%
cols_label(
freq_abs = "Absolute Frequency",
freq_rel = "Relative Frequency"
) %>%
tab_header(
title = "Absolute and Relative Frequency",
subtitle = "NA frequency for occupation_type"
) %>%
fmt_number(
columns = c("freq_abs", "freq_rel"),
decimals = 2,
drop_trailing_zeros = TRUE
) | Absolute and Relative Frequency | |
|---|---|
| NA frequency for occupation_type | |
| Absolute Frequency | Relative Frequency |
| 103,342 | 0.31 |
30.54% of occupation_type are NA. This could be a big concern: how to handle it? Drop it means losing lots of data, keeping it could lead to impute wrong data, which could be worse.
Lets check how the occupation_type labels frequency.
Code
df %>%
summarise(
n = n(),
.by = occupation_type
) %>%
arrange(desc(n)) %>%
gt() %>%
tab_options(
table.width = pct(100)
) %>%
cols_align(align = "center") %>%
cols_label(
occupation_type = "Occupation Type",
n = "Frequency"
) %>%
tab_header(
title = "Occupation Type Label Frequency"
) %>%
fmt_number(
columns = c("n"),
decimals = 2,
drop_trailing_zeros = TRUE
) | Occupation Type Label Frequency | |
|---|---|
| Occupation Type | Frequency |
| NA | 103,342 |
| Laborers | 60,146 |
| Core staff | 33,527 |
| Sales staff | 31,652 |
| Managers | 27,384 |
| Drivers | 20,020 |
| High skill tech staff | 13,399 |
| Accountants | 12,281 |
| Medicine staff | 10,438 |
| Cooking staff | 6,248 |
| Security staff | 6,218 |
| Cleaning staff | 4,594 |
| Private service staff | 2,787 |
| Low-skill Laborers | 1,714 |
| Secretaries | 1,577 |
| Waiters/barmen staff | 1,245 |
| Realty agents | 852 |
| HR staff | 567 |
| IT staff | 436 |
So, here is the situation: we could impute the NA respecting the proportion of the other labels in the dataset or we can drop the NA values. I want to drop the NA. Even if we are losing a third of the data, we are not making assumption about the rest. Also, we still have 235085, which are plenty enough.
Code
df_occ_type <- df %>%
filter(!is.na(occupation_type)) %>%
summarise(
n = n(),
.by = occupation_type
) %>%
arrange(desc(n))
# to extend the palette to the number of labels
extended_palette <- colorRampPalette(RColorBrewer::brewer.pal(9, "Purples"))(df_occ_type %>% nrow())
g <- ggplot(df_occ_type) +
geom_col(aes(x = n, y = reorder(occupation_type, n), fill = reorder(occupation_type, n))) +
theme_minimal() +
labs(
x = "Count",
y = "Labels",
title = "Occupation Type Label Frequency"
) +
scale_fill_manual(values = extended_palette)
ggplotly(g)There is a NA value in at least all the columns: is it possible it is the same observation?
First of all, we can create a list of the columns name which have only 1 NA, using the df_na tibble.
[1] "name_family_status" "name_housing_type" "days_birth"
[4] "days_employed" "flag_mobil" "flag_work_phone"
[7] "flag_phone" "flag_email" "cnt_fam_members"
Then, to check multiple OR condition, we can use if_any. In this case, we are checking if multiple columns are NA, with the previous filter on only one NA in each column.
Code
| id | code_gender | flag_own_car | flag_own_realty | cnt_children | amt_income_total | name_income_type | name_education_type | name_family_status | name_housing_type | days_birth | days_employed | flag_mobil | flag_work_phone | flag_phone | flag_email | occupation_type | cnt_fam_members | target |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6392180 | F | N | N | 0 | 67500 | Working | Secondary / se | NA | NA | NaN | NaN | NaN | NaN | NaN | NaN | NA | NaN | 0 |
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(one_cols)
# Now:
data %>% select(all_of(one_cols))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
Interesting features
Lets also look at some interesting variable: click on the different tabs to see the frequency table for each of them.
Code
count_results$cnt_children %>%
arrange(desc(n)) %>%
gt() %>%
tab_options(
table.width = pct(100)
) %>%
cols_align(align = "center") %>%
cols_label(
cnt_children = "# Children",
n = "Count"
) %>%
tab_header(
title = "Count of Number of Children",
) %>%
fmt_number(
columns = c("n"),
decimals = 2,
drop_trailing_zeros = TRUE,
#drop_trailing_dec_mark = FALSE
) | Count of Number of Children | |
|---|---|
| # Children | Count |
| 0 | 148,909 |
| 1 | 56,238 |
| 2 | 26,196 |
| 3 | 3,319 |
| 4 | 300 |
| 5 | 109 |
| 9 | 4 |
| 12 | 4 |
| 14 | 3 |
| 7 | 2 |
| 19 | 1 |
Code
count_results$name_education_type %>%
arrange(desc(n)) %>%
gt() %>%
tab_options(
table.width = pct(100)
) %>%
cols_align(align = "center") %>%
cols_label(
name_education_type = "Education Type",
n = "Count"
) %>%
tab_header(
title = "Count by Education Type",
) %>%
fmt_number(
columns = c("n"),
decimals = 2,
drop_trailing_zeros = TRUE,
#drop_trailing_dec_mark = FALSE
) | Count by Education Type | |
|---|---|
| Education Type | Count |
| Secondary / secondary special | 157,822 |
| Higher education | 66,693 |
| Incomplete higher | 8,799 |
| Lower secondary | 1,609 |
| Academic degree | 162 |
Code
count_results$name_family_status %>%
arrange(desc(n)) %>%
gt() %>%
tab_options(
table.width = pct(100)
) %>%
cols_align(align = "center") %>%
cols_label(
name_family_status = "Family Status",
n = "Count"
) %>%
tab_header(
title = "Count by Family Status",
) %>%
fmt_number(
columns = c("n"),
decimals = 2,
drop_trailing_zeros = TRUE,
#drop_trailing_dec_mark = FALSE
) | Count by Family Status | |
|---|---|
| Family Status | Count |
| Married | 164,310 |
| Single / not married | 30,551 |
| Civil marriage | 20,968 |
| Separated | 14,108 |
| Widow | 5,148 |
Code
count_results$name_housing_type %>%
arrange(desc(n)) %>%
gt() %>%
tab_options(
table.width = pct(100)
) %>%
cols_align(align = "center") %>%
cols_label(
name_housing_type = "House Type",
n = "Count"
) %>%
tab_header(
title = "Count by House Type",
) %>%
fmt_number(
columns = c("n"),
decimals = 2,
drop_trailing_zeros = TRUE,
#drop_trailing_dec_mark = FALSE
) | Count by House Type | |
|---|---|
| House Type | Count |
| House / apartment | 208,980 |
| With parents | 11,938 |
| Municipal apartment | 7,411 |
| Rented apartment | 3,549 |
| Office apartment | 2,270 |
| Co-op apartment | 937 |
Code
count_results$cnt_fam_members %>%
arrange(desc(n)) %>%
gt() %>%
tab_options(
table.width = pct(100)
) %>%
cols_align(align = "center") %>%
cols_label(
cnt_fam_members = "Number of Family Members",
n = "Count"
) %>%
tab_header(
title = "Count by Number of Family Members",
) %>%
fmt_number(
columns = c("n"),
decimals = 2,
drop_trailing_zeros = TRUE,
#drop_trailing_dec_mark = FALSE
) | Count by Number of Family Members | |
|---|---|
| Number of Family Members | Count |
| 2 | 118,852 |
| 3 | 49,232 |
| 1 | 38,902 |
| 4 | 24,556 |
| 5 | 3,138 |
| 6 | 282 |
| 7 | 109 |
| 11 | 4 |
| 14 | 4 |
| 15 | 3 |
| 9 | 2 |
| 20 | 1 |
Code
count_results$flag_own_car %>%
arrange(desc(n)) %>%
gt() %>%
tab_options(
table.width = pct(100)
) %>%
cols_align(align = "center") %>%
cols_label(
flag_own_car = "Own Car",
n = "Count"
) %>%
tab_header(
title = "Count by Own a Car",
) %>%
fmt_number(
columns = c("n"),
decimals = 2,
drop_trailing_zeros = TRUE,
#drop_trailing_dec_mark = FALSE
) | Count by Own a Car | |
|---|---|
| Own Car | Count |
| N | 138,203 |
| Y | 96,882 |
Code
count_results$code_gender %>%
arrange(desc(n)) %>%
gt() %>%
tab_options(
table.width = pct(100)
) %>%
cols_align(align = "center") %>%
cols_label(
code_gender = "Gender",
n = "Count"
) %>%
tab_header(
title = "Count by Gender",
) %>%
fmt_number(
columns = c("n"),
decimals = 2,
drop_trailing_zeros = TRUE,
#drop_trailing_dec_mark = FALSE
) | Count by Gender | |
|---|---|
| Gender | Count |
| F | 147,505 |
| M | 87,580 |
5 Data preparation
From the EDA, we know we can drop
- NA occupation_type.
- the id 6.39218^{6}.
It could be useful to have the birthday and employed date not as integer, but proper date. We will check after if they are really useful.
6 Split
To take into consideration the unbalance and have it also in our split, we can use the stratify method.
7 Models
7.1 DecisionTreeClassifier
To address the model explainability, we can choose a CART model. This selection has 2 interesting point for our task:
- Explainability, as already mentioned. This will help us explain the reason of rejection.
- Feature selection. At this link you can find a paper which shows how embedded method as the tree models can have the same performance of traditional feature selection method.
7.1.1 GridSearchCV
Even if the result of the model are relatively easy to understand, the hyperparameter tuning has far too many possibility. This is where GridSearchCV() comes handy.
7.1.2 Pipeline
One of the many nice sklearn features is the Pipeline(). It has lots of important benefits:
- Code redundancy and reproducibility : it is like a recipe you can tune and run without worrying about to put togheter each step in the correct order.
- Avoid data leakege: the preprocessing steps applies only to the training dataset.
-
Hyperparameter tuning: the
Pipeline()integrates withGridSearchCV()andRandomizedSearchCV(). - Modularity: it enables tuning the various steps, removing or adding new one.
First, initialize the various preprocessor and the CART model.
Using the ColumnTransformer(), we can apply the different preprocessing to the specific columns. The preprocessing is divided between:
- Encoding the binary labels to 0-1 using
OneHotEncoding(drop="if_binary)". - Encoding the remaining labels using
OneHotEncoding(). I choose this over other categorical variables encoding because it avoid imputing a hierarchy. -
Standardize the numerical variables using
StandardScaler(). I also have instantiated theMinMaxScaler(), but I am not using it.
The classifier is a DecisionTreeClassifier(), which the GridSearchCV() will tune.
All of these steps are putted togheter with the Pipeline().
For the binary labels I had to use the OneHotEncoding(drop="if_binary"), otherwise other preprocessor would not work with the ColumnTransformer().
7.1.3 Fit
Code
grid_search_dtc = GridSearchCV(
estimator=pipe_dtc,
param_grid=param_grid,
cv=7,
scoring=scoring,
refit="accuracy", # This will refit the model using the accuracy metric
n_jobs=-1 # use all the available CPU
)
grid_search_dtc.fit(xtrain, ytrain)
# save the model as pkl file
with open(config.path + "grid_search_dtc.pkl", "wb") as file:
pickle.dump(grid_search_dtc, file)Code
py$cv_results %>%
arrange(across(contains("rank"))) %>%
select(params, mean_test_accuracy, rank_test_accuracy, mean_test_f1, rank_test_f1, mean_test_roc_auc, rank_test_roc_auc) %>%
gt() %>%
tab_header(
title = "Cross Validation Results",
subtitle = "Main metrics"
) %>%
fmt_number(
columns = c("mean_test_accuracy", "mean_test_f1", "mean_test_roc_auc"),
decimals = 3,
drop_trailing_zeros = TRUE,
drop_trailing_dec_mark = FALSE
) %>%
cols_align(
align = "center"
) %>%
cols_label(
params = "criterion - max_depth - max_features - splitter",
mean_test_accuracy = "Mean Test Accuracy",
rank_test_accuracy = "Rank Test Accuracy",
mean_test_f1 = "Mean Test F1",
rank_test_f1 = "Rank Test F1",
mean_test_roc_auc = "Mean Test ROC AUC",
rank_test_roc_auc = "Rank Test ROC AUC",
) %>%
tab_options(
table.width = pct(100)
)| Cross Validation Results | ||||||
|---|---|---|---|---|---|---|
| Main metrics | ||||||
| criterion - max_depth - max_features - splitter | Mean Test Accuracy | Rank Test Accuracy | Mean Test F1 | Rank Test F1 | Mean Test ROC AUC | Rank Test ROC AUC |
| log_loss, 2, 15, random | 0.727 | 1 | 0.2 | 18 | 0.56 | 17 |
| gini, 10, 15, best | 0.707 | 2 | 0.403 | 1 | 0.863 | 2 |
| log_loss, 10, 15, best | 0.704 | 3 | 0.4 | 2 | 0.863 | 1 |
| entropy, 10, 15, best | 0.701 | 4 | 0.394 | 3 | 0.853 | 3 |
| gini, 2, 15, best | 0.677 | 5 | 0.31 | 8 | 0.705 | 11 |
| entropy, 5, 15, best | 0.667 | 6 | 0.368 | 4 | 0.82 | 4 |
| log_loss, 5, 15, best | 0.656 | 7 | 0.365 | 5 | 0.819 | 5 |
| gini, 5, 15, best | 0.651 | 8 | 0.348 | 6 | 0.797 | 6 |
| log_loss, 10, 15, random | 0.605 | 9 | 0.296 | 9 | 0.741 | 7 |
| entropy, 10, 15, random | 0.602 | 10 | 0.274 | 12 | 0.706 | 10 |
| log_loss, 2, 15, best | 0.591 | 11 | 0.31 | 7 | 0.715 | 9 |
| gini, 10, 15, random | 0.583 | 12 | 0.277 | 11 | 0.722 | 8 |
| entropy, 5, 15, random | 0.56 | 13 | 0.231 | 15 | 0.622 | 15 |
| entropy, 2, 15, best | 0.543 | 14 | 0.285 | 10 | 0.677 | 12 |
| gini, 5, 15, random | 0.541 | 15 | 0.234 | 14 | 0.633 | 14 |
| log_loss, 5, 15, random | 0.515 | 16 | 0.246 | 13 | 0.65 | 13 |
| gini, 2, 15, random | 0.513 | 17 | 0.207 | 16 | 0.567 | 16 |
| entropy, 2, 15, random | 0.341 | 18 | 0.203 | 17 | 0.549 | 18 |
The grid search returns a CART with a best score of best_score with the following parameters:
- criterion:
- max_depth:
- splitter:
Now, we can save and load the model. This is useful to avoid to run the grid_search.
7.1.4 Predict
7.1.5 Classification Report
Code
library(reticulate)
df_cr <- py$df_cr %>% dplyr::rename(names = index)
cols <- df_cr %>% colnames()
df_cr %>%
pivot_longer(
cols = -names,
names_to = "metrics",
values_to = "values"
) %>%
pivot_wider(
names_from = names,
values_from = values
) %>%
gt() %>%
tab_header(
title = "Confusion Matrix",
subtitle = "Threshold optimization favoring recall"
) %>%
fmt_number(
columns = c("precision", "recall", "f1-score", "support"),
decimals = 2,
drop_trailing_zeros = TRUE,
drop_trailing_dec_mark = FALSE
) %>%
cols_align(
align = "center",
columns = c("precision", "recall", "f1-score", "support")
) %>%
cols_align(
align = "left",
columns = metrics
) %>%
cols_label(
metrics = "Metrics",
precision = "Precision",
recall = "Recall",
`f1-score` = "F1-Score",
support = "Support"
) %>%
tab_options(
table.width = pct(100)
)| Confusion Matrix | ||||
|---|---|---|---|---|
| Threshold optimization favoring recall | ||||
| Metrics | Precision | Recall | F1-Score | Support |
| 0 | 0.94 | 0.48 | 0.63 | 41,986. |
| 1 | 0.15 | 0.76 | 0.25 | 5,031. |
| accuracy | 0.51 | 0.51 | 0.51 | 0.51 |
| macro avg | 0.55 | 0.62 | 0.44 | 47,017. |
| weighted avg | 0.86 | 0.51 | 0.59 | 47,017. |
CLASSIFICATION ANALYSIS